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Abstract 


Spatial  data  appear  in  numerous  applications,  such  as  GIS  [8,  9,  18],  multimedia  [6]  and  even  traditional 
databases.  Most  of  the  analysis  has  focused  on  point  data,  typically  using  the  uniformity  assumption, 
or,  more  accurately,  a  fractal  distribution  [5].  However,  no  results  exist  for  non-point  spatial  data,  like 
2-d  regions  (e.g.,  islands),  3-d  volumes  (e.g.,  physical  objects  in  the  real  world)  etc. 

This  is  exactly  the  problem  we  solve  in  this  paper.  Based  on  experimental  evidence  that  real  areas 
and  volumes  follow  a  “power  law”,  that  we  named  REGAL  (REGion  Area  Law),  we  show  (a)  the 
theoretical  implications  of  our  model  and  its  connection  with  the  ubiquitous  fractals  and  (b)  the  first  of 
its  practical  uses,  namely  the  selectivity  estimation  for  range  queries.  Experiments  on  a  variety  of  real 
datasets  (islands,  lakes,  human-inhabited  areas)  show  that  our  method  is  extremely  accurate,  enjoying  a 
maximum  relative  error  ranging  from  1  to  5%,  versus  30-70%  of  a  naive  model  that  uses  the  uniformity 
assumption. 


1  Introduction 


Spatial  data  appear  in  numerous  applications,  such  as  GIS  [8,  9,  18],  multimedia  [6]  and  spatiotemporal 
databases  [19].  Statistical  modeling  of  real  data  involves  the  concise  description  of  a  dataset  with  a 
few  parameters  (e.g.,  total  count,  area,  etc.),  so  that  we  can  obtain  accurate  estimates.  Such  a  concise 
description  is  useful  for  at  least  three  settings: 

•  selectivity  for  range  queries,  k  nearest  neighbor  queries,  spatial  joins  etc. 

•  analysis  of  spatial  access  methods  (SAM).  For  example,  how  many  nodes  will  an  R-tree  or  quadtree 
require  to  store  the  data,  how  many  such  nodes  a  query  will  touch,  etc. 

•  generation  of  pseudo-random,  but  realistic,  spatial  datasets,  that  can  stress-test  SAMs,  whenever 
real  data  are  not  available.  For  example,  for  scale-up  studies,  or  for  studies  on  high  dimensionalities 
[2],  where  we  want  to  control  the  statistics,  to  do,  e.g.,  sensitivity  analysis. 

Most  of  the  analysis  efforts  have  focused  on  point  data,  typically  using  the  uniformity  assumption, 
or,  more  accurately,  a  fractal  distribution  [5].  In  fact,  for  point  data,  these  two  numbers  (the  count  and 
the  fractal  dimension  of  a  dataset)  are  sufficient  to  accurately  estimate  selectivities  for  range  queries, 
spatial  joins  and  nearest  neighbor  queries,  as  we  describe  in  the  survey  section. 

However,  no  results  exist  for  non-point  spatial  data,  like  2-d  regions  (islands,  lakes,  vegetation 
patches)  and  3-d  volumes  (physical  objects  in  the  real  world).  We  shall  refer  to  such  data  that  have  non¬ 
zero  d-dimensional  volume  as  region  data,  although  the  upcoming  discussion  holds  for  any  dimensionality 
of  the  address  space.  Thus,  the  problem  we  focus  on  is  the  following:  We  are  given  a  real  set  of  region 
data  (normalized  on  the  unit  d-dimensional  hypercube);  what  is  the  smallest  number  of  parameters  that 
we  need  to  describe  it? 

For  point  data,  the  count  and  the  fractal  dimension  are  enough.  However,  for  region  data,  it  is 
not  even  clear  what  we  mean  by  fractal  dimension:  d-dimensional  region  objects  have  fractal  dimension 
equal  to  d.  Maybe  we  want  the  fractal  dimension  of  the  centers  of  our  region  data?  Or  maybe  something 
else? 

We  answer  all  these  questions  in  the  rest  of  the  paper,  developing  a  realistic  statistical  model  for 
region  data,  and  showing  how  to  use  it  to  compute  the  selectivity  of  range  queries.  Its  maximum  relative 
error  ranges  from  1  to  5%,  versus  30-70%  of  the  naive  model,  based  on  the  uniformity  assumption. 

The  paper  is  organized  as  follows:  Section  2  gives  a  brief  description  of  previous  work  on  the  topic.  In 
Section  3  we  develop  our  model  and  we  show  how  it  can  be  used  to  estimate  selectivity  of  window  queries 
on  regions  datasets  in  the  d-dimensional  space.  Section  4  provides  a  large  collection  of  experimental 
results  on  real  region  data  (collection  of  islands,  lakes,  urban  areas,  etc.).  Section  5  presents  the 
relationship  existing  between  our  model  and  the  fractal  theory,  exploit  it  to  provide  a  realistic  random 
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region  generator  and  suggest  some  directions  for  a  practitioner  for  an  effective  application  of  our  model. 
Finally,  Section  6  contains  concluding  remarks  and  future  work. 

2  Survey 

The  main  topic  within  the  spatial  database  field  which  is  related  to  our  present  work  is  query  opti¬ 
mization.  and,  more  specifically,  selectivity  estimation  of  range  (or  window)  queries,  which  are  the  most 
popular  spatial  access  operation  [15,  13].  In  the  database  community,  there  is  a  mounting  evidence 
that  query  optimization  is  becoming  more  and  more  important  with  the  advent  of  spatial  databases 
consisting  of  petabytes  of  data  and,  in  the  near  future,  of  huge  spatiotemporal  databases  [4]. 

In  [11,  15],  an  analytical  formula  to  compute  selectivity  for  a  window  query  as  a  function  of  the 
underlying  data  morphology  and  distribution  is  given.  To  apply  such  a  formula  when  these  parameters 
are  unknown,  one  typically  makes  the  uniformity  and  /nr/rpc7?dc7?cr  assumption  on  them.  Unfortunately, 
these  assumptions  do  not  hold  for  real  datasets  and  generally  lead  to  pessimistic  results  [3].  Recently,  the 
introduction  of  the  concept  of  fractal  dimension  has  allowed  to  better  describe  the  statistical  properties 
of  the  data  themselves  and  to  precisely  analyze  space  and  time  performances  of  spatial  data  structures 
used  to  store  them.  Using  the  fractal  dimension,  we  can  accurately  estimate  the  performance  of  R-tree 
for  range  queries  [5],  the  selectivity  of  spatial  joins  [1]  and  the  performance  of  nearesUneighbor  queries 
[16].  However,  all  these  works  focus  on  point  data  only.  Therefore,  to  the  best  of  our  knowledge,  this  is 
the  first  attempt  to  model  accurately  region  datasets. 

3  Proposed  method 

In  this  section,  we  first  give  the  problem  definition  and  we  show  a  naive  solution.  After,  we  give  the 
proposed  solution,  for  d  =  2  dimensions  and  for  arbitrary  d.  Table  1  gives  a  list  of  symbols  used 
throughout  this  section. 

3.1  Problem  definition 

Let  us  rigorously  state  the  problem  we  are  concerned  with.  For  the  sake  of  clarity,  let  us  first  focus  on 
the  2-dimensional  space.  Later,  all  the  results  will  be  extended  to  the  r/-dimensional  space. 

PROBLEM:  selectivity  of  rectangles 
Given: 

•  A  set  of  similar  rectangles  (i.e.,  having  a  fixed  given  aspect  ratio  p  between  width  and  height) 
7^  =  {^1,  r2, . . . ,  rj\7]  embedded  in  U  —  [0, 1]  x  [0. 1]. 
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Symbol 

Definition 

7^ 

Dataset  of  rectangles 

N 

Total  number  of  rectangles  of  TZ 

A 

Total  area  of  TZ 

W 

Total  width  of  TZ 

H 

Total  height  of  TZ 

P 

Fixed  ratio  between  width  and  height  of  rectangles  in  TZ 

^max 

Area  of  the  largest  rectangle  in  TZ 

'^max 

Width  of  the  biggest  rectangle  in  TZ 

^max 

Height  of  the  largest  rectangle  in  TZ 

B 

Patchiness  exponent 

C{a) 

Number  of  regions  having  area  at  least  a 

Cw{w) 

Number  of  regions  having  width  at  least  w 

CH{h) 

Number  of  regions  having  height  at  least  h 

II 

Query  window  of  sides  , . . . , 

Sel{n,  ^ 

Avg.  selectivity  for  range  queries  of  sides  gi, . . . , 

u 

Image  space 

Table  1:  Symbol  table 


•  The  total  area  A  oiTZ. 

•  A  X  Qy  window  query  q. 

Find  the  selectivity  Sel{TZ^  ^  in  7^  of  the  window  query  g,  that  is,  the  number  of  rectangles  in  TZ 
intersecting  q. 

The  formula  in  [11,  15]  gives  the  selectivity  when  we  know  the  width  Wi  and  the  height  hi  of  every 
rectangle.  Hence 


N 

Sel{n,^  =  + Wi){qy  + hi),  (1) 

which  can  also  be  written 


Sel{TZ,  ^  =  A  +  q:^  ■  H  +  qy -  W  +  -Qy  •  N  (2) 

where  W  and  H  are  the  total  width  and  height  extent  of  IZ.  The  question  is  to  estimate  the  selectivity 
with  much  less  information. 
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3.2  Naive  solution 


The  major  problem  is  the  following:  what  assumption  should  we  make  about  rectangle  sizes  distribution? 
Is  it  Gaussian?  Is  it  bimodal?  Clearly,  the  most  straightforward  assumption  is  to  assume  that  sizes 
are  uniform.  In  this  case,  being  rectangles  similar  with  aspect  ratio  />,  we  can  merely  conclude  that  the 
expected  width  of  a  rectangle  in  TZ  is  while  its  expected  height  is  Therefore,  since 

>y  =  y|.A'a„d 

it  follows  from  (2)  that  the  expected  selectivity  is 

Sel{TZ,  ^  =  A  +  qy  ■  •  V A  •  N  +  c/j.  •  qy  ■  N.  (3) 

3.3  Proposed  solution:  the  REGAL  law 

However,  real  region  datasets  do  not  obey  the  uniformity  assumption.  Rather,  it  turns  out  that  the 
complementary  cumulative  distribution  function^  (CCDF)  of  the  areas  of  the  regions  obeys  the  following 
hyperbolic  power  law: 

REGion  Area  Law  (REGAL):  The  number  of  regions  C{(i)  of  area  greater  than  or  equal  to  a 
follows  the  hyperbolic  power  law 


(4) 

Korcak  was  the  first  to  observe  such  a  law,  for  the  Aegean  Islands  (he  suggested  B  ^  0.5)  [12].  The 
exponent  B  is  also  called  the  patchiness  exponent.  Recent  measurements  on  2-d  region  datasets  from 
diverse  applications  suggest  that  usually  a  similar  power  law  holds  [10],  with  B  in  the  range  [0.5,  0.9].  In 
Section  5,  we  show  that  the  power  law  (4)  is  related  to  fractals.  Given  that  fractals  appear  surprisingly 
often  in  nature,  we  expect  that  the  majority  of  real  region  datasets  will  obey  (4).  Moreover,  as  a 
consequence  of  the  inherent  self-similarity  of  real  region  datasets,  the  minimum  bounding  rectangles 
(MBRs)  of  the  regions  are  expected  to  follow  the  same  law  as  well. 

Under  the  realistic  assumption  that  rectangles  in  TZ  obey  to  (4),  we  now  show  that  we  can  compute 
much  more  accurate  estimates  on  the  selectivity  if  we  are  given  the  patchiness  exponent  B.  Notice  that 
the  uniformity  assumption  is  unable  to  use  this  extra  information.  We  prove  the  following: 

^Remember  that  the  cumulative  distribution  function  of  /(r)  :  0^  — ^  5R  is  defined  as  F{x)  =  while  the 

complementary  cumulative  distribution  function  is  defined  as  F(r)  =  f^^ 


C(n)  =  A’*G  ^  A-,  fi>0.  G  >  0. 
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Theorem  1  Given  a  set  IZ  =  {ri,  r2, . . rectangles  embedded  in  U  whose  areas  obey  to  the 
REGAL  law,  having  a  fixed  given  aspect  ratio  p  between  width  and  height,  a  total  area  A  and  a  patchiness 
exponent  B,  the  selectivity  of  a  rectangular  window  query  q  is 


(5) 


Proof.  We  start  with  (2).  We  need  to  estimate  the  sum  of  widths  W  and  the  sum  of  heights  H.  By 
assumption,  C{a)  obeys  to  the  REGAL  law  (4).  Hence,  from  the  initial  condition  1  =  C(amax)  = 
where  Omax  is  the  (unknown)  area  of  the  largest  rectangle,  it  follows  that 


C{a)  =  aLx  •  « 

We  need  to  estimate  a^ax*  From  the  inverse  relation,  we  have 

^=ainax-C'“S 

Therefore,  if  ai  denotes  the  area  of  the  i-th  rectangle  of  7^,  it  follows 


^  ^  ~  *^max  /  * 


C-B  dC  = 


-  1 
1-i 


from  which  it  follows 


i-i 


^max  —  ^  *  A  1  ^  *  (6) 

Let  now  Cw{w)  and  C}i{h)  denote  the  number  of  regions  in  11  having  width  and  height  at  least  w 
and  h,  respectively.  Since  the  rectangles  in  11  are  similar,  we  have  a  =  and  a  =  ph?.  Then,  from 
(4),  Cw{w)  obeys  the  following  power  law 


Cwi^^)  =  {p  ■  amax)^  •  w 

and  analogously  for  Cnih) 

CH{h)=  -h-^^- 

Denoting  with 


'f^max  —  •\/ P  ■  ®max  — 


\ 


p-A 


(7) 

(8) 

(9) 


5 


the  width  of  the  largest  rectangle,  and  with 


/?max  —  —  —  *  j 

/>  ^ 


1  -  -fe 


P  ^  A^(i-i)  _  1 


its  height,  (7,  8)  can  be  written 


and 


2B  _.-2B 


C'M’(rr)  =  •  w 


CH{h)  =  hfZ.-h 


2B  L-2B 


Hence,  from  the  inverse  relations  we  have 

\  ^  max  / 


1 

2B  _  J- 

—  ^^’max  *  ^ 


and 


1 

'  2B  _ JL 

—  h  .  r'  2b 

—  /‘max  Cr- 


^(C//)  —  (^f^2B  '  —  "max  ' B 

Therefore,  if  wi  (hi)  denotes  the  width  (height)  of  the  ?-th  rectangle  of  TZ.  it  follows 


JL  rN  _j_ 

W  =  ^  Wi  ss  rCmax  dC\y  =  Z/’niax 


t=:l 


2fl) 

1  -  ^ 

‘  2B 


N 


2'ii )  —  1 


1  -  ^ 


2B 


and  similarly 


if  =  ^  h,  «  h„,ax  /  dCH  =  hr 

i=l 

Therefore,  from  (2,  9,  10,  11,  12)  the  thesis  follows. 


A- (^“21?)  -  1 
1 _ L 

^  2B 


(10) 


(11) 


(12) 


□ 


Observation  1:  Notice  that  for  B  —  1,  by  applying  De  THospitaTs  rule,  (6)  still  holds,  and  more 
precisely  we  have: 


V  =  lim  A  • 


1  -  ir 


B _  _ 


B-+1  A'(’"i)  - 1 

Similarly,  for  B  =  1/2,  (11,  12)  still  hold,  and  we  have: 

A^(*~2b)  —  1 


W  =  lim  uw 

B-H/2 


1 _ 

‘  2B 


=  W, 


c  •  In  N 


and  analogously,  H  =  hmax  •  In  A\ 
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Observation  2:  Equation  (6)  establishes  a  relationship  between  Cmax  and  A,  once  N  is  given.  This 
means  that  we  can  provide  an  accurate  estimation  of  Sel(1Z,  ^  even  if  amax  is  given  instead  of  A.  Notice 
that  in  this  scenario,  using  the  uniformity  assumption,  we  can  merely  conclude  that  A  =  amax--^,  which 
most  likely  will  heavily  overestimate  Sel{jR,,^. 

As  we  show  next,  the  above  theorem  will  provide  a  good  estimation  for  window  selectivity  on  real 
region  datasets. 

3.4  Generalization  to  the  d-dimensional  space 

The  above  result  can  be  extended  to  the  d-dimensional  space.  In  such  a  case,  the  problem  is  as  follows: 
we  are  given  a  set  TZ  =  {ri,  r2, . . .,  rjv}  of  d-dimensional  similar  (i.e.,  having  a  fixed  given  aspect  ratio 
Pij  between  the  i-th  and  the  j-th  side)  hyper-rectangles  embedded  in  U  =  [0, 1]*^,  their  total  volume 
V  and  their  patchiness  exponent  B.  Let  the  CCDF  of  the  volumes  of  the  hyper-rectangles  follows  the 
power  law 


C{v)  =  •  u--® 


Then,  the  selectivity  of  a  rectangular  window  query  q  =  {qi,.  ..,qd)  is 


{n,...,b}e2{i . 


Xi 


/ -  l\ 

I  ) 


d 


i=\ 


where  denotes  the  power  set^  of  {1, , . d]  and 


Xi  = 


n  — 

i=i 


V 


1  -  ^ 


Ar(i-i)  _  1 


n 


is  the  i-th  side  of  the  largest  object  having  volume  Umax- 

The  above  formula  can  be  proved  by  analogy  from  the  2-dimensional  case  and  is  here  omitted. 

The  above  expression  can  be  simplified  when  working  on  set  of  hyper-cubic  objects.  In  such  a  case, 
setting  aimax  the  side  of  the  largest  region,  we  have 


Sei(K,?)  =  V+  E 

. d}\{0,{l,...,d}}  \  ^  d-B  /  i=l 

where 

^Remember  that,  gives  a  set  5,  the  power  set  of  S,  denoted  as  2^,  is  defined  as  the  set  of  all  subsets  of  S,  including  the 
empty  set  and  S  itself.  If  S  is  finite,  the  ceirdinality  of  2^  is  2^^h 
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ix  —  ^  *^’niax  — 


1-i 


The  above  expression,  for  square  window  queries  of  side  q.  further  simplifies  to 


d-\ 


_  1 


Selin,  7)  =  V  +  E  :  .  9-^  .  •  iV. 

y=i  \^/  \  ^  ~  d-B  ' 


4  Experiments  on  real  datasets 


To  assess  experimentally  the  accuracy  of  our  analysis,  we  have  used  three  different  region  datasets,  that 
is: 


•  The  Scandinavian  Lakes  (LAKES),  available  at  http://mapweb.parc.xerox.com/map/nogrid 
(Xerox  PARC  Map  Viewer)  and  consisting  of  810  lakes. 

•  The  Indonesia  Archipelago  (ISLANDS) ,  available  at  http :  / /mapweb .  pare .  xerox .  com/map/nogrid, 
and  consisting  of  470  islands. 

•  A  population  density  map  of  Europe  (REGIONS).  This  map  has  been  created  starting  from  a 
population  density  map  from  a  World  Atlas.  Each  grid  cell  is  turned  to  black  if  it  has  density 
above  a  threshold,  namely  30  inhabitants/Km^.  It  consists  of  757  regions. 

We  also  used  three  additional  datasets:  the  Aegean  Islands  (51  islands)  and  the  Japan  Archipelago 
(186  islands),  both  available  at  http://mapweb.parc.xerox.com/map/nogrid,  and  a  map  of  Italy 
agricultural  plains  (228  regions),  created  starting  from  a  geographic  map  from  a  World  Atlas  and 
turning  to  black  a  grid  cell  whenever  it  is  at  most  50  meters  above  the  sea  level.  We  do  not  give  details 
about  these  datasets  since  the  results  were  similar. 

In  the  following  subsections  we  present  results  for:  (a)  verifying  that  the  MBRs  of  the  regions  obey 
to  the  REGAL  law  (4);  (b)  verifying  the  accuracy  of  our  formula  (5)  as  compared  to  the  formula  derived 
using  a  uniformity  assumption  (3). 

4.1  Verifying  the  REGAL  law 

All  the  datasets  were  stored  using  1024  x  1024  bitmaps,  as  shown  in  Figure  la-c.  Preliminary,  we  have 
identified  all  the  regions  and  their  MBRs  in  each  dataset.  Then,  we  have  computed  all  the  relevant 
features  needed  for  checking  our  results.  These  data  are  summarized  in  Table  2.  Note  that  to  estimate 
fi,  we  have  computed  the  CCDF  of  the  MBRs  area  for  each  dataset  and  we  have  interpolated  the  plotted 
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Dataset 

N 

A 

B 

P 

LAKES 

810 

75,910 

0.85 

1.13 

ISLANDS 

470 

136,893 

0.60 

1.98 

REGIONS 

757 

190,526 

0.70 

0.53 

Table  2:  Datasets  features. 

points  with  a  straight  line  using  the  classic  least-square  method.  Note  also  that  p  has  been  computed 
by  averaging  over  all  the  MBRs’  aspect  ratios. 

Figure  le-f  shows  in  a  log-log  diagram  the  obtained  results,  along  with  the  regression  line,  whose 
negated  slope  is  the  patchiness  exponent  B.  It  is  impressive  that  the  MBRs  of  all  three  datasets,  even 
if  their  characteristics  are  so  different,  obey  almost  perfectly  to  (4). 

4.2  Accuracy  of  our  selectivity  estimation 

To  ascertain  the  accuracy  of  our  formula  (5)  as  compared  to  the  formula  derived  used  a  uniformity 
assumption  (3),  for  each  dataset  we  have  initially  computed  the  real  selectivity^  using  (1).  After,  we 
applied  (3,5),  for  query  windows  of  width  2%  «  =  0, ...  10  and  having  three  different  aspect  ratios:  1:1 
(square),  1:2  and  2:1.  Figure  2  shows  the  relative  error  of  our  approach,  as  compared  to  that  of  the 
uniformity  model,  for  the  LAKES,  ISLANDS  and  REGIONS  dataset,  respectively.  Note  that,  for  each 
dataset,  our  approach  is  usually  within  1%  to  the  reality,  and  never  exceeds  a  5%  of  relative  error. 
On  the  other  hand,  the  uniformity  model  can  give  up  to  70%  relative  error.  Note  also  that  the  ratio 
between  the  relative  error  of  the  uniformity  model  and  the  proposed  model  is  enormous:  in  particular, 
for  1:2  window  queries  on  the  REGIONS  dataset,  it  is  44  in  the  average  (i.e.,  the  proposed  model  is 
44  times  more  precise!).  Finally,  following  the  recommendations  from  statistics,  we  have  also  computed 
the  geometric  average  of  relative  errors,  for  each  dataset  and  for  each  different  window  aspect  ratio, 
summarized  in  Table  3.  Even  in  this  case,  the  ratio  between  the  two  models  is  huge:  in  particular,  for 
1:2  window  queries  on  the  REGIONS  dataset,  it  has  a  peak  of  30. 

5  Discussion 

In  this  section,  we  first  discuss  the  relationship  existing  between  the  patchiness  exponent  and  the  fractal 
dimension,  and  we  exploit  it  to  provide  a  realistic  random  region  generator  and  a  fast  estimation  of  B. 
After,  we  suggest  directions  to  a  practitioner  to  fully  exploit  our  method  application. 

®Of  course,  edl  the  computations  have  been  normeJized  to  the  1024  x  1024  image  space. 
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(b)  ISLANDS 


(c)  REGIONS 


(e)  LAKES 


(e)  ISLANDS  (f)  REGIONS 


Figure  1;  Used  datasets:  (a)  LAKES;  (b)  ISLANDS;  (c)  REGIONS,  together  with  their  patchiness 
plots:  log(count)  log(area)  for  (d)  LAKES;  (e)  ISLANDS;  (f)  REGIONS. 


5.1  Fractals,  patchiness  and  random  region  generators 

Power  laws  go  hand-in-hand  with  self-similarity  and  fractals  [17]  and  (5)  is  no  exception.  Let’s  see  why 
this  is  the  case,  and  how  we  can  use  fractals  to  our  advantage.  First,  a  quick  introduction  to  fractals  is 
necessary. 

Preliminaries:  A  fractal  is  a  set  of  points  that  are  exactly  or  statistically  self  similar.  Exact  fractals 
are  generated  recursively,  by  a  “generator”,  applied  recursively  to  an  “initiator”.  Figure  3  gives  an 
example  of  a  famous  fractal,  the  “Koch  snowflake”.  Figure  3a  gives  a  (unit)  line  segment  (the  initiator); 
Figure  3b  gives  the  generator,  consisting  of  Nr  =  4  smaller  segments,  each  of  size  r  =  1/3.  Figure  3c 
shows  the  replacement  of  each  of  the  segments  of  3b  with  a  miniature  replica  of  the  generator.  Re¬ 
peating  the  process  to  infinity,  we  obtain  the  “Koch  curve”.  Notice  that  it  has  infinite  length,  that  is 
lim„_,oo(4/3)”.  Gluing  3  such  curves  together,  we  obtain  the  Koch  snowflake  (Figure  3d). 

The  (Hausdorff)  fractal  dimension  Dfj  for  a  strictly  self-similar  fractal  is  defined  as 
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Geometric  average  relative  error  (%) 


Ratio 

1:1 

1:2 

2:1 

Dataset 

REGAL 

UNIF 

REGAL 

UNIF 

REGAL 

UNIF 

LAKES 

0.92 

8.30 

1.08 

8.63 

0.82 

8.43 

ISLANDS 

0.58 

15.66 

1.77 

14.52 

0.69 

17.68 

REGIONS 

0.78 

12.63 

0.51 

15.44 

1.85 

10.70 

Table  3:  Geometric  average  relative  error  (%)  in  estimating  Sel{Tl,  ^  of  the  proposed  method  (REGAL) 
as  compared  to  the  uniform  model  (UNIF),  for  each  dataset  and  for  each  aspect  ratio  of  the  query 
window. 


Dh  = 


logA^r 


(13) 


log(l/r) 

and  gives  a  measure  of  the  “roughness”  of  the  fractal.  For  a  straight  line,  we  have  Dh  =  1;  for  the  Koch 
snowflake,  we  have  Dfj  =  log  4/ log  3,  slightly  higher  than  1,  that  is,  it  is  more  rugged  than  a  straight 
line.  One  of  the  most  rugged  curves  is  the  Hilbert  curve,  with  fractal  dimension  Djj  =  2  [14],  hence  it 
is  a  space-filling  curve  [7]. 


(j“fractals:  However,  fractals  like  the  Koch  snowflake  consist  of  a  single  region.  For  generating  multiple 
regions,  we  make  use  of  the  so-called  a— fractals.  Figure  4  gives  a  possible  a— fractal  generator,  together 
with  the  resulting  regions  after  2  iterations  on  the  sides  of  the  square  with  side  Smax- 
It  turns  out  that  the  following  theorem  holds: 

Theorem  2  (Mandelbrot)  For  a  a -fractal  in  a  d-dimensional  space^  we  have 

where  B  is  the  patchiness  exponent  of  the  regions  (d-dimensional  volumes)  and  Dh  is  the  fractal  dimen¬ 
sion  of  their  boundaries. 

Proof.  See  [14].  □ 


Given  the  inherent  self-similarity  in  real  datasets,  the  above  relationship  holds  for  real  datasets  too. 
Our  experiments  on  diverse  region  datasets  as  well  as  previous  studies  [10,  14]  confirm  that  the  law 
strongly  holds  for  lakes,  archipelagoes,  vegetative  ecosystems,  urban  areas  and  many  others,  as  shown 
in  Table  4. 
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LAKES  Dataset 


Window  width  Window  width  Window  width 

ISLANDS  Dataset 

SolMtMtyratitIvoorror  (1:1  window)  SolMttvIty  rolathra  arror  (1  ;2  window)  SaiKiMly  ralativo  orror  (2;  1  window) 


Window  width  Window  width  Window  width 


REGIONS  Dataset 


Salactivlly  r»l«llv«*n'or(1:1  window)  Solactfvity  rolativo  orrar  (1 :2  window)  SoioctMty  raiativa  arror  (2  1  window) 


1  10  100  1000  1  10  100  1000  1  10  100  1000 


Window  width  Window  width  Window  width 


Figure  2:  LAKES,  ISLANDS  and  REGIONS  (from  top  to  down)  datasets:  percent  relative  error  vs 
query  window  width,  for  square,  1:2  and  2:1  window  queries  (from  left  to  right).  Proposed  method 
(“x”)  and  uniformity  model  (‘‘O"). 


In  conclusion,  the  theory  of  <T~fracta]s  is  extremely  suitable  for  the  study  of  real  regions  datasets. 

The  reasons  are  the  following: 

1.  It  leads  to  regions  that  obey  the  power  law  (4)  with  some  patchiness  exponent.  As  we  illustrated, 
several  diverse  real  datasets  obey  this  law. 

2.  It  provides  an  easy,  recursive  algorithm  to  generate  self-similar,  realistic  region  datasets.  All  we 
have  to  do  is  to  choose  some  values  of  Nr  and  r,  such  that  log  Av/  log  r  =:  B/f/,  choose  an  iterator 
with  that  Nr  and  r,  and  apply  it  recursively  as  many  times  as  needed. 

3.  It  provides  useful  theorems  (like  for  instance,  Theorem  2)  which  link  the  patchiness  exponent  B 
with  the  fractal  dimension  Dfj  of  the  boundary  of  a  set  of  regions.  This  is  important,  because 
we  can  tap  the  literature  of  fractals,  where  the  fractal  dimension  of  several  datasets  is  mentioned 
(e.g.,  see  appendix  in  [14,  17])  to  obtain  an  accurate  estimation  of  B. 
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Figure  3:  Koch  snowflake:  (a)  initiator;  (b)  generator; 


(c)  second  iteration;  (d)  relative  Koch  snowflake. 


n 


Nr  =  12 
r  =  1/8 

=  ^  =  119 


Figure  4:  The  region  generator  (left)  and  the  synthetic  dataset  after  two  steps  of  generation  (right). 

5.2  Directions  for  a  practitioner:  fast  estimation  of  B 

The  final  question  is:  how  can  a  practitioner  benefit  of  our  analysis?  We  have  solid  answers  to  this 
question.  In  fact,  up  to  now,  assuming  safely  that  N  and  A  are  given  in  advance,  to  estimate  selectivity 
for  range  queries  making  use  of  developed  formulas  [11,  15],  one  was  required  to  compute  the  total  width 
and  length  extent  of  the  MBRs  of  the  objects.  This  can  be  done  in  two  ways:  either  applying  image 
processing  algorithm  to  the  representing  bitmap  for  extracting  the  regions  and  compute  their  MBRs, 
or,  alternatively,  by  scanning  the  entire  database  storing  the  data.  Both  approaches  are  time  expensive. 

On  the  contrary,  the  patchiness  exponent  B  and  the  ratio  p  can  be  computed  quickly.  Concerning 
p,  a  robust  solution  is  to  average  the  aspect  ratios  over  a  small  number  of  regions.  Concerning  B,  we 
suggest  two  possible  fast  ways  to  compute  it,  both  of  them  based  on  sampling.  The  first  makes  use  of 
the  representing  bitmap,  while  the  second  works  on  the  database  storing  the  data: 

1.  Focus  on  a  subwindow  of  size  txt  of  the  bitmap,  extract  the  boundaries  of  the  objects  contained 
in  it  and  apply  the  O(tlogt)  time  algorithm  [1]  to  compute  their  fractal  dimension  D^.  Assuming 
that  regions  are  self-similar  (and  then  subwindows  of  the  bitmap  are  similar  to  the  whole)  and 
applying  Theorem  2,  we  can  conclude  that  B  =  Dm/2  is  a  good  approximation  for  the  real  B  of 
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Dataset 

Dh 

B 

Dh  -  2B 

LAKES 

1.78 

0.85 

0.08 

ISLANDS 

1.2.3 

0.60 

0.03 

REGIONS 

1.48 

0.70 

0.08 

Aegean  Island 

1.08 

0.52 

0.04 

Japan  archipelago 

1.19 

0.59 

0.01 

Italy  plains 

1.32 

0.63 

0.06 

Whole  Earth  [14] 

1.2 

0.6 

0 

Cypress  vegetation  [10] 

0.62 

1.23 

0.01 

Table  4:  Connection  between  and  B  for  real  datasets:  abov^e  the  line,  our  own  experiments  (LAKES, 
ISLANDS,  REGIONS,  Aegean  Islands,  Japan  Archipelago  and  Italy  agricultural  plains).  Below  the  line, 
data  drawn  out  from  [14,  10],  resp. 

all  the  map. 

2.  Focus  on  a  subwindow  of  the  bitmap,  retrieve  from  the  database  all  the  objects  contained  in  it 
and  compute  the  CCDF  of  their  areas.  Then,  plot  the  obtained  points  in  a  log-log  diagram  and 
interpolate  them  with  a  straight  line  using  the  classic  least-square  method.  The  negated  slope  of 
such  a  line  corresponds  to  the  patchiness  exponent  of  the  subset  of  objects.  Once  again,  assuming 
that  regions  are  self-similar,  we  can  be  confident  that  such  exponent  is  representative  for  the  whole 
dataset. 

Finally,  if  both  approaches  are  not  practicable,  we  can  easily  obtain  an  accurate  lower  bound  on  the 
selectivity  by  setting  B  =  0.5  and  an  accurate  upper  bound  by  setting  B  =  0.9,  since  B  is  experimentally 
known  to  range  over  the  interval  [0.5, 0.9]. 

Therefore  we  conclude  that  our  analysis  is  suitable  in  practice  and  contributes  to  the  solution  to  the 
problem  of  query  performance  evaluation  in  real  spatial  databases. 

6  Conclusions 

The  main  contribution  of  this  paper  is  the  accurate  modeling  of  real  region  datasets,  such  as  archipela¬ 
goes,  areas  of  vegetation,  city  regions,  plain  maps,  hydro-graphic  systems  and  many  others.  We  showed 
that  very  few  measures  are  needed  (the  total  count  of  objects,  the  total  volume,  the  average  aspect 
ratios  among  the  sides  of  an  object  and  the  patchine.ss  exponent),  to  achieve  extremely  accurate  results. 
Our  experiments  on  diverse,  real  datasets,  showed  that  our  approach  achieves  selectivity  estimates 
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within  1-5%  for  the  maximum  relative  error,  against  30-70%  of  a  naive  model  that  uses  the  uniformity 
assumption. 

We  also  pinpointed  the  connection  between  the  patchiness  exponent  B  and  fractals,  and  specifically 
the  (7-fractals.  The  immediate  benefits  are  (a)  a  fast  method  to  estimate  B  and  (b)  a  simple  method 
to  generate  realistic  region  data. 

Promising  future  directions  include  the  use  of  o'-fractals  to  study  selectivities  of  additional  query 
types  (nearest  neighbor  etc.)  and  to  analyze  SAMs  on  real  region  data. 
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